This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
In this workshop, you will learn how to read quantitative databases (or data sets) in R. These databases have a typical structure composed of rows (or observations), columns (or variables), and values (also called data points). We also say that these values or data points are located in cells, the intersection between a particular row and variable.
In particular, we focus on databases storage in Excel.
Why Excel? Because many scholars, institutions, and organizations storage their data in Excel files. Many people also learn Excel before R, and migrate for the former to the later when learning statistics and data science. It is also not uncommon that our quantitative skills get better while we still struggle to reproduce what we tended to do in Excel in R.
From this viewpoint, this workshop is a first step to close the gap between your Excel skills and your R skills in managing data (i.e, data wrangling and pre-processing).
We will focus on three different types of Excel files: xls., xlsx., and csv. And we will cover multiple common scenarios when working with those files.
We will work mainly with the readxl library, created by
Hadley Wickham. He is one of the Data Scientist behind
tidyverse, a famous collection of R package.
We’ll install several packages at once.
# Set the CRAN mirror
options(repos = "https://cran.r-project.org")
install.packages(c("readxl", "writexl", "purrr"))
##
## The downloaded binary packages are in
## /var/folders/ll/khc5jd09519gycf27wrhyzt80000gn/T//Rtmp6pbcnX/downloaded_packages
?cli
We’ll load several packages at one applying lappy.
packages_to_load <- c("readxl", "writexl", "cli", "purrr", "stringr", "stringi")
lapply(packages_to_load, library, character.only=TRUE)
## [[1]]
## [1] "readxl" "stats" "graphics" "grDevices" "utils" "datasets"
## [7] "methods" "base"
##
## [[2]]
## [1] "writexl" "readxl" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[3]]
## [1] "cli" "writexl" "readxl" "stats" "graphics" "grDevices"
## [7] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "purrr" "cli" "writexl" "readxl" "stats" "graphics"
## [7] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "stringr" "purrr" "cli" "writexl" "readxl" "stats"
## [7] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[6]]
## [1] "stringi" "stringr" "purrr" "cli" "writexl" "readxl"
## [7] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [13] "base"
We will use a set of data frames included in the readxl
package. They are all together in a single Excel file but separated into
different sheets.
We use the readxl_example function to call the specific
data set we want and storage it in the object named
xlsx_example.
xlsx_example <- readxl_example("datasets.xlsx")
We can now check the name of the different sheets.
excel_sheets(xlsx_example)
## [1] "iris" "mtcars" "chickwts" "quakes"
So, the first sheet contains the iris data, while the last one contains the quake data. In total, there are four sheets containing the following data sets:
Now, use the read_excel function to open the Excel
file.
read_excel(xlsx_example)
Notice that it opens the first sheet (iris) by default – the function
is using the argument sheet=1. You can use either the sheet
number or its name.
Let’s run the code again but making explicit sheet=1.
Also, save this code in the iris object to keep the data in our Global
Environment.
iris <- read_excel(xlsx_example, sheet=1) #You do not need sheet=1
We can rapidly explore our data set.
iris
First, we can check the structure of the data.
str(iris)
## tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
## $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
We can also check the variables names.
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Additionally, we can explore the first values. In this case, we are asking the first 10 values.
head(iris , n=10)
However, the important point here is that we can call a specific sheet in an Excel file.
Read and save the chickwts data set from the “datasets.xlsx”. Hint:
you can use excel_sheets again to check the sheets located
in the data set.
What happens when we want to read multiple sheets at the same time?
We can do so by using purrr functions. The
purrr package enhances functional programming. Check the
following:
“…if you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read.” (Purr 1.0.2 Reference)
Indeed, we will apply the map() function with the
purpose of using a programming tool or strategy called
iteration. Iterations helps when you want to apply the same
function to multiple objects, but you want to avoid duplicated code and
reduce time (i.e., doing it efficiently). It means you can replace
several lines of duplicated code for a single one and perform the same
function across multiple objects. Recall: R objects include vectors,
data sets, and lists.
Does iteration sounds similar to something you do in Excel? Hint: think about dragging a formula.
Now, the map function requires two arguments. First, we
need to specify inputs; that is, a list of elements. The second part of
map is simple the function we are applying to the list. For
example, check the following image:
The list are the databases separated into sheets in the
“datasets.xlsx”. To access them, we use again
excel_sheets.
sheet_names <-excel_sheets(xlsx_example)
sheet_names
## [1] "iris" "mtcars" "chickwts" "quakes"
But we add a nuance here, which is the set_names
function that will assign names of the elements to the output map will
create. By default, the names are the original ones in “datasets.xlsx”.
We can also assign different names to the sheets.
set_names(sheet_names, c("data_1", "data_2", "data_3", "data_4"))
## data_1 data_2 data_3 data_4
## "iris" "mtcars" "chickwts" "quakes"
So, the previous line of code with two functions – set_names and sheet_names – provides our inputs. We can save it.
input_sheets <- set_names(sheet_names, c("data_1", "data_2", "data_3", "data_4"))
Meanwhile, the function we will apply to each element within
input_sheets is read_excel. We use map().
Notice that path is the route to accessing the database. We
will use this argument (path) later on.
list_dataframes <-map(input_sheets, #input
read_excel, path = xlsx_example) #function and path
So, list_dataframes is an object containing our four dataframes, named “data_1”, “data_2”, “data_3”, and “data_4”.
list_dataframes
## $data_1
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
##
## $data_2
## # A tibble: 32 × 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # ℹ 22 more rows
##
## $data_3
## # A tibble: 71 × 2
## weight feed
## <dbl> <chr>
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
## 7 108 horsebean
## 8 124 horsebean
## 9 143 horsebean
## 10 140 horsebean
## # ℹ 61 more rows
##
## $data_4
## # A tibble: 1,000 × 5
## lat long depth mag stations
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -20.4 182. 562 4.8 41
## 2 -20.6 181. 650 4.2 15
## 3 -26 184. 42 5.4 43
## 4 -18.0 182. 626 4.1 19
## 5 -20.4 182. 649 4 11
## 6 -19.7 184. 195 4 12
## 7 -11.7 166. 82 4.8 43
## 8 -28.1 182. 194 4.4 15
## 9 -28.7 182. 211 4.7 35
## 10 -17.5 180. 622 4.3 19
## # ℹ 990 more rows
Now, we will want to separate the databases and save them separately
in the Global Environment. The base R list2env function is
very useful for these purposes.
list2env(list_dataframes, envir = .GlobalEnv) #save in Global Environment
## <environment: R_GlobalEnv>
Done! We can now work with the four different data frames.
Let’s put your new skills into work. The “sample.xlsx” data setis located in the following link. We want you to perform three tasks:
First, check the sheet names and save them into the sheet_names2. Hint: there are two sheets.
Second, change the sheet names to data_a and data_b. Save them into the input_sheets2 object.
Finally, read and save each data set into the Global Environment.
Hint: use path to specify the location of “sample.xlsx” in
your computer.
Let´s keep using “sample.xlsx”. Imagine you only want to keep the
first 20 rows in bank-full. You can work with data_a
in our Global Environment. Alternatively, you can do it directly while
reading the data using n_max.
read_excel(path = "/Users/danielencinas/Downloads/sample.xlsx",
sheet = 1,
n_max = 20)
For removing 10000 rows, we can use the skip argument.
The problem is that we also remove the column names. So, we will use the
col_names argument.
First, we store the column names from data_a. Let us repeat the three steps above.
getwd()
## [1] "/Users/danielencinas"
setwd("/Users/danielencinas/Downloads")
sheet_names2 <- excel_sheets(path="sample.xlsx")
sheet_names2
## [1] "bank-full" "iris"
input_sheets2 <- set_names(sheet_names2, c("data_a", "data_b"))
sheet_names2
## [1] "bank-full" "iris"
map(input_sheets2, read_excel, path = "/Users/danielencinas/Downloads/sample.xlsx") %>%
list2env(envir = .GlobalEnv)
## <environment: R_GlobalEnv>
Now, we can work with data_a.
columns_a <- colnames(data_a)
columns_a
## [1] "age" "job" "marital" "education" "default" "balance"
## [7] "housing" "loan" "contact" "day" "month" "duration"
## [13] "campaign" "pdays" "previous" "poutcome" "y"
Then, we specify the column names in the col_names
argument.
read_excel(path = "/Users/danielencinas/Downloads/sample.xlsx",
sheet = 1,
skip = 10000,
col_names =columns_a)
For specifying the column type, we use the col_type
argument. But notice that we can effectively remove variables when
setting “skip”.
read_excel(
path = "/Users/danielencinas/Downloads/sample.xlsx",
sheet = 1,
skip = 10000,
col_types = c(
"text",
"text",
"text",
"text",
"text",
"numeric",
"skip",
"skip",
"skip",
"skip",
"skip",
"skip",
"skip",
"skip",
"skip",
"skip",
"skip"
)
)
For instance, the previous code only keeps 6 out of 17 variables.
Now, you can also select only a range of rows, columns, or cells (i.e., the intersection between rows and columns). First, let us start by selecting rows.
read_excel(path = "/Users/danielencinas/Downloads/sample.xlsx",
sheet = 1,
range = cell_rows(1:10000))
Similarly, we can select only the first six rows.
read_excel(path = "/Users/danielencinas/Downloads/sample.xlsx",
sheet = 1,
range = cell_cols(1:6))
Finally, we can select cells. Recall Excel name rows as letters and columns as numbers.
read_excel(path = "/Users/danielencinas/Downloads/sample.xlsx",
sheet = 1,
range = "A1:F45211")
The read_excel function supports both xlsx. and xls.
files. What about csv. files (i.e., comma delimited files)? We use the
read_csv function or read.csv. But let’s first
write some csv. files.
First, we can save data_1 as iris.
write.csv(data_1, file = "iris.csv")
Alternatively, you can save the data as a xlsx. document.
write_xlsx(data_1, "iris.xlsx")
Save data_2 using its original name. Also, save it as xls. rather than xlsx. Hint: notice above the way to specify the file type.
We can also write multiple data frames. First, let us save a list of data frames.
list_dataframes2 <-c("data_1", "data_2", "data_3", "data_4")
We can now write a function to rename and save csv. files in our computer
write_csv_list <- function(df_list) { #The function starts here
for (i in seq_along(df_list)) { #A for loop starts here
name <- df_list[i] #We can extract the names of the data
path<-paste0("/Users/danielencinas/Downloads/new_", name ,".csv") #Change accordingly
write.csv(df_list[[i]], file=path) #Export as csv and save using path above
} #Here it ends the loop
} #The end of the function
Let us understand the function bit by bit. First, we extract the names of data bases name with indexing (df_list[i]) . For instance, you can find the second name. In the function above, we save the output as name.
list_dataframes2[2]
## [1] "data_2"
We also specify the path argument in our function. We
use paste0 to create a pattern with the specific path as well as name I
am using. For instance, it works as follows with data_1
paste0("/Users/danielencinas/Downloads/new_", "data_1",".csv")
## [1] "/Users/danielencinas/Downloads/new_data_1.csv"
Finally, apply the function to our list.
write_csv_list(list_dataframes2)
Sources:
Northwestern University↩︎